Data storage system manager and method for managing a data storage system

ABSTRACT

A data storage system manager includes one or more servers, at least one data collector deployed on at least one of the servers, at least one policy engine deployed on at least one of the servers, and at least one configuration manager deployed on at least one the servers. The at least one data collector is configured to collect resource utilization information including data storage wear rate of data storage system data storage modules. The at least one policy engine is configured to evaluate the collected information and to initiate changes to a configuration of the data storage system based on data storage wear rate and work load distribution policies. The at least one configuration manager is configured to implement the changes initiated by the at least one policy engine to control the data storage wear rate and a skew of the work load distribution within the data storage system.

BACKGROUND

Copy-on-write (“COW”) is an optimization strategy used in computerprogramming. Multiple requesters of resources that are initiallyindistinguishable are given pointers to the same resource. This strategyis maintained until a requestor attempts to modify its copy of theresource. A private copy is then created to prevent any changes frombecoming visible to the other requesters. The creation of such privatecopies is transparent to the requesters. No private copy is created if arequestor does not attempt to modify its copy of the resource.

Virtual memory operating systems may use COW. If a process creates acopy of itself, pages in memory that may be modified by the process (orits copy) are marked COW. If one process modifies the memory, theoperating system's kernel may intercept the operation and copy thememory so that changes in one process's memory are not visible to theother.

COW may also be used in the calloc function provided in the C and C++standard libraries for performing dynamic memory allocation. A page ofphysical memory, for example, may be filled with zeroes. If the memoryis allocated, the pages returned may all refer to the page of zeroes andmay be marked as COW. As such, the amount of physical memory allocatedfor a process does not increase until data is written.

A memory management unit (MMU) may be instructed to treat certain pagesin an address space of a process as read-only in order to implement COW.If data is written to these pages, the MMU may raise an exception to behandled by a kernel. The kernel may then allocate new space in physicalmemory and make the page being written correspond to that new locationin physical memory.

COW may permit efficient use of memory. Physical memory usage onlyincreases as data is stored in it.

Outside a kernel, COW may be used in library, application and systemcode. For example, the string class provided by the C++ standard libraryallows COW implementations. COW may also be used invirtualization/emulation software such as Bochs, QEMU and UML forvirtual disk storage. This may (i) reduce required disk space asmultiple virtual machines (VMs) may be based on the same hard disk imageand (ii) increase performance as disk reads may be cached in RAM andsubsequent reads served to other VMs outside of the cache.

COW may be used in the maintenance of instant snapshots on databaseservers. Instant snapshots preserve a static view of a database bystoring a pre-modification copy of data when underlying data areupdated. Instant snapshots are used for testing or moment-dependentreports. COW may also be used as the underlying mechanism for snapshotsprovided by logical volume management.

COW may be used to emulate a read-write storage on media that requirewear leveling or are physically Write Once Read Many.

ZFS is a file system designed by Sun Microsystems for the SolarisOperating System. The features of ZFS may include support for highstorage capacity, integration of the concepts of file system and volumemanagement, snapshots and COW clones, on-line integrity checking andrepair, and RAID-Z.

Unlike traditional file systems, which may reside on single devices andthus require a volume manager to use more than one device, ZFS filesystems are built on top of virtual storage pools referred to as zpools.A zpool is constructed of virtual devices (vdevs), which are themselvesconstructed of block devices: files, hard drive partitions or entiredrives.

Block devices within a vdev may be configured in different ways,depending on need and space available: non-redundantly (similar to RAID0), as a mirror (RAID 1) of two or more devices, as a RAID-Z (similar toRAID 5 with regard to parity) group of three or more devices, or as aRAID-Z2 (similar to RAID 6 with regard to parity) group of four or moredevices. The storage capacity of all vdevs may be available to all ofthe file system instances in the zpool.

ZFS uses a COW transactional object model. All block pointers within thefile system may contain a 256-bit checksum of the target block which isverified when the block is read. Blocks containing active data are notoverwritten in place. Instead, a new block is allocated, modified datais written to it and then any metadata blocks referencing it aresimilarly read, reallocated and written. To reduce the overhead of thisprocess, multiple updates may be grouped into transaction groups. Anintent log may be used when synchronous write semantics are required.

If ZFS writes new data, the blocks containing the old data may beretained, allowing a snapshot version of the file system to bemaintained. ZFS snapshots may be created quickly, since all the datacomposing the snapshot is already stored. They may also be spaceefficient, since any unchanged data is shared among the file system andits snapshots.

Writeable snapshots (“clones”) may also be created, resulting in twoindependent file systems that share a set of blocks. As changes are madeto any of the clone file systems, new data blocks may be created toreflect those changes. Any unchanged blocks continue to be shared, nomatter how many clones exist.

ZFS employs dynamic striping across all devices to maximize throughput.As additional devices are added to the zpool, the stripe widthautomatically expands to include them. Thus all disks in a pool areused, which balances the write load across them.

ZFS uses variable-sized blocks of up to 128 kilobytes. Currentlyavailable code allows an administrator to tune the maximum block sizeused as certain workloads may not perform well with large blocks.

If data compression is enabled, variable block sizes are used. If ablock can be compressed to fit into a smaller block size, the smallersize is used on the disk to use less storage and improve I/O throughput(though at the cost of increased CPU use for the compression anddecompression operations).

In ZFS, file system manipulation within a storage pool may be lesscomplex than volume manipulation within a traditional file system. Forexample, the time and effort required to create or resize a ZFS filesystem is closer to that of making a new directory than it is to volumemanipulation in some other systems

SUMMARY

A data storage system manager includes one or more servers including atleast one data collector, at least one policy engine, and at least oneconfiguration manager. The at least one data collector is configured tocollect resource utilization information including data storage wearrate of data storage system data storage modules and a skew of work loaddistribution within the data storage system. The at least one policyengine is configured to evaluate the collected information and initiatechanges to a configuration of the data storage system based on wear rateand work load distribution policies. The polices specify one of amaximum data storage wear rate that depends on the skew of the work loaddistribution, and a maximum skew for the work load distribution thatdepends on the data storage wear rate. The at least one configurationmanager is configured to implement the changes initiated by the at leastone policy engine to control the data storage wear rate and skew of thework load distribution.

A data storage system manager includes one or more servers, at least onedata collector deployed on at least one of the servers, at least onepolicy engine deployed on at least one of the servers, and at least oneconfiguration manager deployed on at least one the servers. The at leastone data collector is configured to collect resource utilizationinformation including data storage wear rate of data storage system datastorage modules. The at least one policy engine is configured toevaluate the collected information and to initiate changes to aconfiguration of the data storage system based on data storage wear rateand work load distribution policies. The at least one configurationmanager is configured to implement the changes initiated by the at leastone policy engine to control the data storage wear rate and a skew ofthe work load distribution within the data storage system.

A method for managing a data storage system includes, at one or moreservers, collecting resource utilization information including datastorage wear rate of data storage system data storage modules and workload distribution within the data storage system, evaluating thecollected information, and initiating changes to a configuration of thedata storage system based on wear rate and work load distributionpolicies that specify one of a maximum data storage wear rate thatdepends on the work load distribution and a maximum skew for the workload distribution that depends on the data storage wear rate. The methodfurther includes implementing the initiated changes to a configurationof the data storage system to control the data storage wear rate andskew of the work load distribution.

While example embodiments in accordance with the invention areillustrated and disclosed, such disclosure should not be construed tolimit the invention. It is anticipated that various modifications andalternative designs may be made without departing from the scope of theinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an embodiment of a storage system.

FIG. 2 is a block diagram depicting data flow in another embodiment of astorage system.

FIG. 3 is a block diagram depicting data flow in an embodiment of aglobal management system.

DETAILED DESCRIPTION

The performance of storage arrays may be limited by several factorsincluding the mechanical latencies of magnetic disk drives, the cost andvolatility of semiconductor memories, and centralized architectures withinherent limitations to scaling of performance, capacity andinterconnect.

Non-volatile semiconductor memory technology, e.g., NAND Flash, maysupplant magnetic disk in applications where performance and power takeprecedence over raw capacity and cost per bit. Existing implementationsof this technology have mimicked those of magnetic disk by using thesame command sets, same communications protocols, and often the samephysical dimensions as magnetic drives. Certain embodiments disclosedherein, however, spring from the premise that the benefits of Flashtechnology, e.g. no mechanical latency, parallel access, and themitigation of its disadvantages, e.g. erase before write, write cyclewear-out limitations, may not be achieved with partitioning andinterconnect schemes designed for magnetic disk.

Existing high performance storage platforms rely on centralized, sharedcontrol and semiconductor cache. This design choice mitigates thelatencies of magnetic disk and the low bandwidths of interconnecttechnologies available at the time their architectures were defined.This design choice, however, may result in a compromise between the costof an entry level system and the maximum size and performance that maybe achieved.

By making use of NAND Flash and modern low latency/high bandwidthinterconnect standards within the context of a high-performance computerand suitable file system technology, some embodiments disclosed hereinmay have the potential for industry leading small-block input/outputoperations per second (IOPS), total IOPS per storage system, cost perIOPS, floor space per IOPS, and power per IOPS.

Many transaction processing and data base systems generate small-block,random data access requests at a very high rate. This rate grows as thesemiconductor technology of the processor running these applicationsimproves. Disk capacities and instantaneous data rates continue toincrease, but the mechanical seek and rotational latency delays remainrelatively constant, leading to a slow growth in the number of diskIOPS. This may create the need for employing ever larger and moreexpensive disk systems. A system, however, that provides a higher accessdensity than can be achieved with disk drives may be a suitablealternative. Flash memory supports a higher I/O rate per GB of storage(or a higher-access density) than disk. Flash memory may also providelower cost per GB of storage than RAM technologies. Certain embodimentsdisclosed herein may capitalize on these attributes and thus may delivera scalable and cost effective storage system that provides theenterprise class reliability, availability and serviceability desired.

Traditional disk drives may be replaced with Solid-State Disks (SSDs) totake advantage of Flash memory technology. This may yield an improvementin I/O rate. These high-speed SSDs, however, may expose systembottlenecks that are currently hidden by the low I/O rates of diskdrives. The disk command protocol stack overhead in the controller andthe disk command/status processing time in the SSD may dwarf the datatransfer time for small-block transfers. Disk command protocols may alsohide data usage information from the Flash controller, and hide wear outand failure information from the control unit.

Disk access hardware and disk drive interconnect may be replaced withFlash memory hardware that is located within the disk controller, nearthe controller's cache. High performance designs may put the Flashmemory within the controller. This may limit the amount of Flashstorage. Regardless of the location of the Flash memory, traditionaldisk management processes do not appear to address the uniquerequirements of Flash memory. With static mapping of host disk addressesto real “disk drives” (Flash storage), hot spots in the host's writeaccess patterns may lead to premature wear out of the associated Flashstorage. Small random writes to Flash storage may force the Flashcontroller to move pages of data around to free an entire Flashblock-the smallest portion of the Flash memory that can be erased. Thismay impact performance and reduce lifetime.

Certain embodiments disclosed herein may avoid at least some of thesystem bottlenecks of current disk based architectures, and produce ascalable storage system whose performance and value approach the limitsof the underlying Flash technology. These embodiments may include:

-   (i) Multiple independent ZFS instances, each of which may be    responsible for the management of a portion of the overall capacity.    ZFS instances may be spread across a pool of servers that also    contain the host interface ports.-   (ii) Redirection of each I/O request from the receiving port to the    ZFS instance(s) responsible for the blocks requested. In an example    case, this is a redirection from the host port on one I/O server to    a ZFS instance on another I/O server. This redirection stage may    allow any part of the capacity to be reached from any port. The ZFS    instance may then issue the necessary direct transactions to Flash    and/or non-volatile RAM (NVRAM) to complete the request.    Acknowledgements or data may then be forwarded back to the host    through the originating port.-   (iii) A low latency, memory mapped network that may tie together,    for example, front end ports, ZFS instances, NVRAM and Flash. This    network may be implemented with InfiniBand among servers, and    between servers and storage units, and with PCI Express internal to    the I/O servers and storage units. The servers and storage units may    communicate as peers. The redirection traffic and ZFS/memory traffic    may both use the same fabric.-   (iv) One or more storage units may contain, for example, both Flash    and NVRAM. These storage units may be designed for high availability    with hot swapping and internal redundancy of memory cards, power,    cooling and interconnect. An InfiniBand external interconnect may be    translated to two independent PCI Express trees by two concentrator    boards. The system may thus have rapid access to Flash as well as    the NVRAM. The RAM may be made non-volatile by backing it up to    dedicated Flash on loss of power. The mix of Flash and NVRAM cards    may be configurable-both may use the same connector and board    profile.-   (v) A global management facility (data storage system manager) that    may supervise the operation of the storage system in a    pseudo-static, “low touch” approach, intervening when capacity must    be reallocated between ZFS instances, for global Flash wear    leveling, for configuration changes, and for failure recovery.

The “divide and conquer” strategy of dividing the capacity amongindividual ZFS instances may enable a high degree of scalability ofperformance, connectivity and capacity. Additional performance may beachieved by horizontally adding more servers and then assigning lesscapacity per ZFS instance and/or fewer ZFS instances per server.Performance may also be scaled vertically by choosing faster servers.Host ports may be added by filling available slots in servers and thenadding additional servers. Additional capacity may be achieved by addingadditional storage units, and allocating the new capacity to ZFSinstances.

Referring now to FIG. 1, a storage system 10 may include a plurality ofI/O servers 12 n (12 a, 12 b, etc.), e.g., blade or standalone servers,a plurality of switch units 14 n (14 a, 14 b, etc.), e.g., InfiniBandexpandable switch units, and one or more storage units 16 n (16 a, 16 b,etc.). Other suitable configurations are also possible. An externalinterface provider 18 n (18 a, 18 b, etc.), data storage controller 20 n(20 a, 20 b, etc.), and global management system (data storage systemmanager) 21 n (21 a, 21 b, etc.) may be deployed on each of the servers12 n. (The providers 18 n and the controllers 20 n may, of course, beimplemented in hardware and/or software.)

The storage units 16 n of FIG. 1 may include, for example, a pluralityof Flash boards 22 n (22 a, 22 b, etc.) and NVRAM boards 24 n (24 a, 24b) connected to concentrator boards 26 n (26 a, 26 b) via, for example,PCI Express. Each of the storage units 16 n may be an integralrack-mounted unit with its own internally redundant power supply andcooling system. Active components such as memory boards, concentratorboards, power supplies and cooling may be hot swappable.

In the embodiment of FIG. 1, the providers 18 n, controllers 20 n andboards 22 n, 24 n may communicate as peers via a Remote Direct MemoryAccess (RDMA) protocol conveyed through the switch units 14 n. An I/Oserver, for example, I/O server 12 a, may communicate with the boards 22n, 24 n using this RDMA protocol. In addition, each of the I/O servers12 n may communicate with all of the other I/O servers 12 n using theRDMA protocol. Any suitable communication scheme, however, may be used.

The providers 18 n are each capable of receiving (read or write) dataaccess requests, identifying, via a mapping for example, the controller20 x that can service the request (which may be deployed on a differentserver), and routing the request to the identified controller 20 x.

The controllers 20 n each exclusively manage a portion of data contentin at least one of the boards 22 n, 24 n, and may satisfy data accessrequests received from the providers 18 n by accessing their datacontent.

Referring now to FIG. 2 where elements having like numerals have similardescriptions to FIG. 1, the external interface provider 118 a mayreceive a data access request from a host. The provider 118 a mayidentify, via a mapping, etc., the controller 120 x (which, in thisexample, is deployed on the server 112 b) that can satisfy the request.The provider 118 a may then redirect the request to the identifiedcontroller 120 b. The identified controller 120 b may receive theredirected request, and then access, in response, data in any of theboards 122 n, 124 n that may satisfy the request. (Although the use ofFlash memory for data storage modules has been discussed in detail,other memory technologies, including magnetic disk, may also be used.)

In certain embodiments:

-   (i) Two or more servers may provide fault tolerance. If one server    fails, another server may take over the work that was being    performed by the failed server. Likewise, two or more data storage    modules may provide fault tolerance.-   (ii) NVRAM boards may provide access to data more quickly (lower    access latency) than Flash memory boards, though at the cost of    providing lower capacity (and thus higher cost per bit of storage).-   (iii) Servers may provide one or more of the following services:    host interfacing, routing data access requests to the appropriate    server, handling data access requests, managing data stored in a set    of data storage modules and/or non-volatile memory cards, managing    the assignment of data and storage space to individual servers, and    migrating data from one server to another for global wear leveling,    to better balance the workload, or to handle system configuration    changes, component wear-out or component failure and replacement.-   (iv) The servers and storage (e.g., data storage modules and/or    non-volatile memory cards) may communicate with each other over an    interconnect fabric that allows server-to-server, server-to-storage,    and/or storage-to-storage communications.-   (v) The process of routing all I/O access requests for a particular    block of data to the server that is responsible for storing that    block of data may provide a consistent view of the stored data    across all of the host interfaces without requiring server-to-server    synchronization on each host I/O request.-   (vi) The data storage module interfaces may be divided into separate    sets, with each set used by only one server. This partitioning of    the interfaces may eliminate the need for two or more servers to    share a single interface and may eliminate the need for the servers    to coordinate their use of a single interface. A single data storage    module may provide one interface or multiple interfaces.-   (vii) The data storage in each non-volatile memory card may be    subdivided into separate regions with a single server having    exclusive use of a region. This may allow multiple servers to store    data within a single non-volatile memory card without having to    coordinate their use of the storage space provided by the    non-volatile memory card.-   (viii) The process of managing data stored on the data storage    modules and/or non-volatile memory cards may include repositioning    updated data that is written to the data storage modules and    maintaining mapping information that identifies the location of the    most recently written instance of each data block. This process may    distribute write activity across the storage space within each data    storage module and distribute write activity across multiple data    storage modules. This distribution of the write activity may provide    wear leveling, which may be important for Flash memory devices in    which data storage cells wear out after being erased and programmed    repeatedly.-   (ix) The interconnect fabric may employ a memory block transfer    protocol, e.g., InfiniBand and/or PCI Express, rather than a disk    command protocol. This may be beneficial for accesses to the    non-volatile memory cards because it eliminates the overhead of the    disk command protocol stack both in the server and in the    non-volatile memory card. The data storage modules may use a RDMA    protocol to make efficient use of the memory block transfer protocol    interconnect and to give the data storage modules the flexibility of    scheduling data transfer activities in an order that best suits the    needs of the storage medium.

Flash-based storage may address the need for high access density byeliminating the mechanical delays of disk and by keeping a large numberof Flash chips active at the same time. This may yield dramaticallyhigher I/O rates per GB of storage compared to disk. Accessing Flashstorage over a low-latency interconnect fabric using a low-overheadmemory access protocol, and providing multiple Flash controllers perFlash storage unit may maximize the number of Flash chips being accessedin parallel-yielding even higher access density than high-capacity SSDs.

Coordinating the use of shared resources may add overhead to the processof carrying out an I/O request. Overhead that may be small in relationto disk access times may become quite significant when compared to Flashaccess times. In systems that share access to a central cache, tovarious data paths, and to disk interfaces, coordinating the use ofthese and other shared resources may require locking and serializedexecution. This may add delay to critical timing paths in thecontroller, and these delays may increase as the size of the systemincreases and the level of resource utilization goes up. To eliminatethese coordination delays from the critical timing paths, certaincontroller architectures disclosed herein may use multiple, independentservers running independent instances of ZFS. The system's storage isspread across these ZFS instances so that each ZFS instance has soleresponsibility for managing its cache and its assigned portion of theFlash storage. The servers may then route each (host) access request tothe appropriate instance of ZFS, and route the resulting data andcompletion status from this ZFS instance back to the originating hostport. This assignment of storage and workload to ZFS instances mayremain static over the time period of, for example, millions of I/Ooperations. Occasionally, the global management service running on theservers may move data and workload from one ZFS instance to another whennecessary, for example, for wear leveling or load balancing.

The servers may need to be able to quickly commit small-block writetraffic to storage that is fault-tolerant and that preserves data acrossa power failure. The NVRAM modules in the Flash storage units may appearto the servers to be word-writable memory. Mirrored writes to theseNVRAM modules may provide the fastest means of committing small-blockwrite transactions so that the controllers can report to the host thatthe write operation has been completed.

After many program/erase cycles, Flash memory cells may wear out-losingtheir ability to retain the data stored in the cell. To preventpremature wear out due to frequently writing portions of the Flashmemory, embodiments of the storage system may employ a wear levelingstrategy. This may require dynamic mapping of host disk addresses toFlash memory locations. (A single, large map, for example, thattranslates disk addresses to Flash memory pages may not scale well.)These embodiments may, for example, employ three wear levelingstrategies that work at different levels in the Flash managementhierarchy. As an example, each ZFS instance may use COW so that it canwrite complete RAID stripes, which automatically balances write trafficover the regions of Flash storage owned by that ZFS instance. Closer tothe Flash chips, each Flash memory controller may perform wear levelingfor the Flash chips that it controls, managing its own small mappingtable. If certain embodiments were to allow a long-term imbalance in thewrite activity handled by the instances of ZFS, the Flash storage ownedby the most active ZFS instances may wear out too soon. To prevent this,the system's global management services may migrate frequently writtendata from one instance of ZFS to another, leading to global writebalancing. The performance and scalability advantages of using separatepools of Flash storage managed by multiple, independent ZFS instancesmay outweigh the occasional overhead of data migration.

Flash memory cells may be subject to soft errors. Flash chips mayprovide extra cells in which the Flash controller stores an errorcorrection code that remedies the most frequent error cases. Each ZFSinstance may spread its data across multiple Flash storage boards sothat it can recover data that is lost due to an uncorrectable error, thefailure of a Flash chip, or the replacement of an entire Flash storageboard. To allow many instances of ZFS to run in parallel, each ZFSinstance may access and be responsible for managing only a portion ofthe storage in any one Flash storage board. This may lead to therequirement that each Flash storage board provide multiple independentaccess controls so that the ZFS instances do not have to spend timecoordinating their accesses.

As discussed above, the write activity generated by hosts attached tocertain embodiments of the data storage system may be unevenlydistributed across all of the data held by the storage system. Someportions of the data assigned to one instance of ZFS may be written farmore frequently than the rest of the data managed by the other ZFSinstances. This may lead to premature wear out of the Flash storagebeing managed by that ZFS instance.

Wear leveling may be generally practiced within SSDs. In data storagesystems that employ SSDs, this wear leveling may not address imbalancesin write activity between SSDs, especially when the storage system usesa static mapping of host data addresses to SSDs addresses or when theSSDs are managed by independent file systems. This may lead to some SSDsin a large storage system wearing out long before other SSDs in thesystem.

In distributed data storage systems such as Lustre, the assignment of aportion of the host address space to a computer within the data storagesystem may be done statically. There are no provisions for moving datafrom one computer to another without disrupting the host's ability toaccess the data. In certain embodiments disclosed herein, the globalmanagement system may gather usage information such as data storage wearrate, skew of the work load distribution (the degree to which the loaddistribution is balanced-the lower the skew, the more uniform the loadbalancing, the higher the skew, the less uniform the load balancing,etc.), write activity, and remaining lifetime from the independent filesystems within the data storage system. It may then use this informationto determine when to move data from one file system to another and whatdata to move. It may also use this information to dynamically adjust aconfiguration of the data storage system to control the data storagewear rate and skew of the work load distribution.

Embodiments of the global management system may determine, for example,the initial placement of data when the system configuration is firstdefined. It may then respond to configuration changes, either due toadding or removing storage system components, or due to the failure andreplacement of components, by assigning new data storage to the filesystems and/or storage units, removing unused data storage from the filesystems and/or storage units, or moving data from one file system and/orstorage unit to another (or within a storage unit). Some embodiments ofthe global management system may be implemented as software and/orhardware that executes on one or more of the servers/computers withinthe data storage system.

Embodiments of the global management system may carry out one or more ofthe following activities: divide the host-access address space of thestorage system into partitions and assign partitions to individual filesystem instances; divide the data storage space within the system intopartitions and assign partitions to individual file systems so as toyield good system performance, and to limit the performance impact andrisk of data loss due to failures of the data storage devices or otherfailures (within the data storage system or external to the data storagesystem); distribute a map describing which portions of the system'saddress space are assigned to each file system instance so that dataaccess requests can be routed to the appropriate file system instance;inform each file system of the portions of the data storage space withinthe system that can be used by that file system instance; identify datathat should be moved from one file system instance to another in orderto improve wear leveling or to balance the workload and reducebottlenecks; update the distributed map to describe the mapping of databeing moved from one file system to another; receive notification ofdata storage system configuration changes due to adding or removingsystem components (system upgrades), or due to the failure andreplacement of components; identify data that should be moved from onefile system to another, and any appropriate changes to the assignment ofphysical storage to file system instances; direct the file systeminstances to move data and adjust physical storage allocation to adaptto the configuration change, and update the distributed map asappropriate; determine when the storage system should be serviced toreplace storage components that have failed, have reached their end oflife, or are approaching end-of-life; and when machine service isrequired for other reasons, report those storage components that areapproaching end-of-life, so that they can be replaced during the sameservice call, eliminating the cost of a subsequent service call.

Referring now to FIGS. 1 and 3, an embodiment of the global managementsystem 21 a may include a data collector 28 a, policy engine 30 a, andconfiguration manager 32 a. The data collector 28 a, policy engine 30 aand configuration manager 32 a may be implemented inhardware/firmware/software/etc. or any combination thereof. The otherglobal management systems 21 b, etc. of FIG. 1 may be configuredsimilarly to the global management system 21 a. In certain embodiments,however, the policy engines 30 b, etc. of these other global managementsystems 21 b, etc. may be inactive as explained below.

Each of the respective data collectors 28 n collect, in a known fashion,resource utilization information associated with the activities of theserver 12 n on which they are deployed. The data collector 28 a maycollect, for example, data storage wear rate, skew of the work loaddistribution, write activity, remaining lifetime, etc. This resourceutilization information may be forwarded to (or requested by) one of thepolicy engines 30 n.

In the embodiment of FIGS. 1 and 3, the policy engine 30 a is active andserves as a master policy engine while the other policy engines 30 n(e.g., 30 b, etc.) are inactive. As a result, utilization informationcollected by the data collector 28 a is forwarded to the policy engine30 a; utilization information collected by the data collector 28 b isalso forwarded to the policy engine 30 a, etc. This scheme permits asingle policy engine 30 a to initiate configuration changes for the datastorage system 10. Known techniques may be used for electing a singlemaster policy engine 30 a and distributing the results of such anelection to the data collectors 28 n that will be communicating with theelected master policy engine 30 a. In other embodiments, utilizationinformation collected by the data collector 28 b may be forwarded to thedata collector 28 a, which then may be forwarded to the master policyengine 30 a. Other scenarios are also possible.

The policy engine 30 a may specify several wear rate and workloaddistribution policies as discussed above. An example policy may specifymigrating data from one of the flash boards 22 n (e.g., the flash board22 a) to another one or more of the flash boards 22 n (e.g. flash boards22 b and 22 c, flash boards in another storage unit, etc.) to enableremoval of the flash board 22 a. Another example policy may specify amaximum wear rate for the flash boards 22 n that depends on a skew (oruniformity) of the work load distribution: the maximum wear rate mayincrease/decrease as the skew of the work load distributionincreases/decreases. Yet another example policy may specify a maximumskew of the work load distribution that depends on the wear rate of theflash boards 22 n: the maximum skew of the work load distribution mayincrease/decrease as the wear rate of the flash boards 22 nincreases/decreases. These example policies (or policies, for example,directed to the storage units, etc.) may allow the global managementsystem 21 a to sacrifice wear rate to improve the uniformity of the workload distribution, or sacrifice uniformity of the work load distributionto improve wear rate. If, for example, the work load distribution isgenerally uniform and the wear rate distribution across the flash boards22 n is non-uniform to the extent that prolonged operation with theanticipated workload and wear rate distributions may lead to prematurewear out of the flash boards 22 n, the global management system 21 a mayinitiate a configuration change to cause the work load distribution tobecome less uniform so as to improve wear rate, etc. Conversely, if thewear rate distribution is generally uniform and the work loaddistribution is considerably non-uniform, the global management system21 a may allow the wear rate to become more non-uniformly distributed soas to improve the uniformity of the work load distribution.

In some embodiments, policy engine 30 a may track historical trends inthe data collected by the data collectors 28 n and initiate aconfiguration change when justified by the magnitude of improvement inthe data storage system 10 operation that the policy engine 30 aanticipates will result from the configuration change. The policy engine30 a may balance the conflicting goals of, on the one hand, achieving adesired wear rate or workload distribution and, on the other hand,minimizing the impact to the data storage system 10 operations due tocarrying out a configuration change.

Based on an evaluation of the collected information from the datacollectors 28 n, the current configuration of the data storage system10, and in light of the polices in force, the policy engine 30 a mayinitiate configuration changes for the data storage system 10. Whenconfiguration changes apply/affect/etc. portions of the data storagesystem 10 managed by one of the servers 12 n, for example 12 a, theconfiguration change requests may be directed to the correspondingconfiguration manager 32 a.

The configuration managers 32 n may implement the configuration changesinitiated by the policy engine 30 a so as to so control both the wearrate and work load distribution within the data storage system 10. Forexample, the configuration change may specify moving somefrequently-written data from flash card 22 a to flash cards 22 b and 22c to reduce the wear rate of flash card 22 a.

While embodiments of the invention have been illustrated and described,it is not intended that these embodiments illustrate and describe allpossible forms of the invention. For example, while certain embodimentsdescribed herein were discussed within the context of ZFS, otherembodiments may be implemented in different contexts such as alog-structured file system, a dynamically mapped data management system,etc. The words used in the specification are words of description ratherthan limitation, and it is understood that various changes may be madewithout departing from the spirit and scope of the invention.

1. A data storage system manager comprising: one or more serversincluding (I) at least one data collector configured to collect resourceutilization information including data storage wear rate of data storagesystem data storage modules and a skew of work load distribution withinthe data storage system, (II) at least one policy engine configured toevaluate the collected information and initiate changes to aconfiguration of the data storage system based on wear rate and workload distribution policies that specify one of (i) a maximum datastorage wear rate that depends on the skew of the work load distributionand (ii) a maximum skew for the work load distribution that depends onthe data storage wear rate, and (III) at least one configuration managerconfigured to implement the changes initiated by the at least one policyengine to control the data storage wear rate and skew of the work loaddistribution.
 2. The data storage system manager of claim 1 wherein theresource utilization information further includes write activity.
 3. Thedata storage system manager of claim 1 wherein the maximum data storagewear rate increases as the skew of the work load distribution increases.4. The data storage system manager of claim 1 wherein the maximum datastorage wear rate decreases as the skew of the work load distributiondecreases.
 5. The data storage system manager of claim 1 wherein themaximum skew of the work load distribution increases as the data storagewear rate increases.
 6. The data storage system manager of claim 1wherein the maximum skew of the work load distribution decreases as thedata storage wear rate decreases.
 7. The data storage system manager ofclaim 1 wherein the initiated changes include migrating data from one ofthe data storage system data storage modules to at least another one ofthe data storage system data storage modules.
 8. The data storage systemmanager of claim 1 wherein the data storage wear rate and work loaddistribution polices further specify migrating data from one of the datastorage system data storage modules to at least another one of the datastorage system data storage modules to enable removal of the one of thedata storage system data storage modules.
 9. A data storage systemmanager comprising: one or more servers; at least one data collectordeployed on at least one of the servers and configured to collectresource utilization information including data storage wear rate ofdata storage system data storage modules; at least one policy enginedeployed on at least one of the servers and configured to (i) evaluatethe collected information and (ii) initiate changes to a configurationof the data storage system based on data storage wear rate and work loaddistribution policies; and at least one configuration manager deployedon at least one the servers and configured to implement the changesinitiated by the at least one policy engine to control the data storagewear rate and a skew of the work load distribution within the datastorage system.
 10. The data storage system manager of claim 9 whereinthe resource utilization information further includes at least one ofwrite activity and work load distribution.
 11. The data storage systemmanager of claim 9 wherein the data storage wear rate and work loaddistribution polices specify a maximum data storage wear rate thatdepends on the skew of the work load distribution.
 12. The data storagesystem manager of claim 11 wherein the maximum data storage wear rateincreases as the skew of the work load distribution increases.
 13. Thedata storage system manager of claim 11 wherein the maximum data storagewear rate decreases as the skew of the work load distribution decreases.14. The data storage system manager of claim 9 wherein the initiatedchanges include migrating data from one of the data storage system datastorage modules to at least another one of the data storage system datastorage modules.
 15. The data storage system manager of claim 9 whereinthe data storage wear rate and work load distribution policies specify amaximum skew for the work load distribution that depends on the datastorage wear rate.
 16. The data storage system manager of claim 15wherein the maximum skew for the work load distribution increases as thedata storage wear rate increases.
 17. The data storage system manager ofclaim 15 wherein the maximum skew for the work load distributiondecreases as the data storage wear rate decreases.
 18. The data storagesystem manager of claim 9 wherein the data storage wear rate and workload distribution polices specify migrating data from one of the datastorage system data storage modules to at least another one of the datastorage system data storage modules to enable removal of the one of thedata storage system data storage modules.
 19. A method for managing adata storage system, the method comprising: at one or more servers,collecting resource utilization information including data storage wearrate of data storage system data storage modules and work loaddistribution within the data storage system; evaluating the collectedinformation; initiating changes to a configuration of the data storagesystem based on wear rate and work load distribution policies thatspecify one of (i) a maximum data storage wear rate that depends on thework load distribution and (ii) a maximum skew for the work loaddistribution that depends on the data storage wear rate; andimplementing the initiated changes to a configuration of the datastorage system to control the data storage wear rate and skew of thework load distribution.
 20. The method of claim 19 wherein the resourceutilization information further includes write activity.